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ABSTRACT 

In 2004, the SEED (http://pubseed.theseed.org/) was 
created to provide consistent and accurate genome 
annotations across thousands of genomes and as a 
platform for discovering and developing de novo an- 
notations. The SEED is a constantly updated inte- 
gration of genomic data with a genome database, 
web front end, API and server scripts. It is used by 
many scientists for predicting gene functions and 
discovering new pathways. In addition to being a 
powerful database for bioinformatics research, the 
SEED also houses subsystems (collections of func- 
tionally related protein families) and their derived 
FIGfams (protein families), which represent the 
core of the RAST annotation engine (http://rast. 
nmpdr.org/). When a new genome is submitted to 
RAST, genes are called and their annotations are 
made by comparison to the FIGfam collection. If 
the genome is made public, it is then housed 
within the SEED and its proteins populate the 
FIGfam collection. This annotation cycle has 
proven to be a robust and scalable solution to the 
problem of annotating the exponentially increasing 
number of genomes. To date, > 12 000 users world- 
wide have annotated >60000 distinct genomes 
using RAST. Here we describe the interconnected- 
ness of the SEED database and RAST, the RAST 
annotation pipeline and updates to both resources. 



INTRODUCTION 

Starting in the mid-1990s, entire bacterial and archaeal 
genomes were beginning to be sequenced. These early 
sequencing projects were large undertakings, fraught 
with technical challenges and requiring thousands of 
man-hours to complete. Major obstacles resulted from 
limitations in sequencing technology and the onerous 
task of determining the functions of each gene. Early on, 
genome annotation was largely a by-hand effort, and it 
could take an individual researcher several months to 
annotate a single megabase of DNA (1,2). Depending on 
the organism, the end result was a somewhat dissatisfying 
reflection of the current knowledge of the field. For 
instance, at the time only 62% of the genes in 
Escherichia coli K-12 could be assigned a functional role 
(3). In organisms that were not as well studied this number 
was far worse; for instance, only 38% for the archaeon 
Methanocaldococcus jannaschii (4). In the past 16 years 
these numbers have improved with >90% of the genes 
in E. coli K-12 and ~70% of the genes in M. jannaschii 
having a known functional role (5-7). These gains have 
been achieved through direct research on these organisms 
and the integration of data from research on other 
organisms. 

From its inception in 2004, the goal of the SEED 
project has been to integrate annotations from a wide 
variety of sources and to use them to improve our know- 
ledge about microbial genomes (5). Many scientists are 
experts in a circumscribed area of physiology or metabol- 
ism. By capturing information from individual scientists in 
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annotated subsystems, we leverage their expertise in the 
annotation and analysis of all microbial genomes, not just 
the few model systems that are well studied. Thus, each 
genome covers the expertise of a wide range of biologists 
that would not have otherwise been used if individual 
genomes had been annotated one-by-one. The initial in- 
vestment in manual curation by skilled biologists building 
subsystems that include all available genomes has now 
formed the basis of many thousands of automated anno- 
tations at high levels of accuracy. We believe that auto- 
mated annotation systems, like the one used by the SEED, 
will ultimately reach the point where they can match the 
performance of the most skilled human annotators; and 
they will reach this point via incremental improvements 
where limited amounts of manual annotation play a 
central role. 



THE SEED 

The SEED continually integrates different types of 
genomic data from a variety of sources. These include 
public genomes annotated by RAST (8), expert user an- 
notations, metabolic modeling data (9,10), expression 
data, literature references verifying annotations (11) and 
links to data from other popular resources including 
Swiss-Prot (12), GenBank (13), IMG (14), KEGG (15), 
CDD (16) and so forth. These data are made accessible 
primarily in two ways: through web access (5) and high- 
performance computing servers that are accessible pro- 
grammatically via an API and server scripts (17) (tutorials 
are available at http://www.theseed.org/). 

The SEED Web site (SEED viewer) 

The SEED Web site presents a rich environment for 
genome annotation and comparison. Inspired by the 
Google search page, the SEED start page has also a 
single window, which allows the user to search for a 
genome of interest, a gene, a protein, a feature or a func- 
tional role. The same page provides dropdown menus for 
other entries into the SEED Viewer environment. 
Registration to the SEED is only required for users that 
would like to make changes to the database. For each 
protein in a genome, the SEED Web site offers a protein 
page that contains direct links to the NCBI CDD database 
(16), the KEGG Enzyme database (15) and PubMed ID 
links to articles describing the functional role of a given 
gene product (1 1) (15 565 links). Perhaps the most popular 
tool on the SEED Web site is the 'Compare Regions 
View', which is an integral part of each protein page. 
This tool allows users to compare the genomic neighbor- 
hood of a given gene across genomes. The user has the 
ability to set the number of genomes that the gene of 
interest is compared with, the similarity threshold for in- 
clusion in the comparison, the coloring of genes based on 
similarity and the size of the region being displayed. This 
tool provides a powerful means for finding and correcting 
gene calls and for predicting new functions based on 
conserved genomic context (Figure 1). Many protein 
pages now have links to pre-computed alignments and 
trees. For some of the SEED organisms the protein page 



also has links to expression data that has been pre- 
processed to present 'Atomic Regulons', sets of co- 
expressed genes. Information of this kind is invaluable 
when disambiguating the products of paralogous 
genes (18). 

The SEED and RAST Web sites support a multitude of 
comparative genomics tools. For example, as shown in 
Figure 2, users can readily identify insertions and deletions 
in up to nine target genomes compared with one reference 
genome using the 'Sequence Based Comparison Tool'. 
The tool colors each gene based on protein similarity 
using BLAST (19), and each gene is marked as being 
unique, a unidirectional best hit or a bidirectional best 
hit in comparison to the reference genome. The output 
also includes a whole-genome schematic colored by 
BLAST similarity and BLAST dot-plots between 
compared organisms. The resulting data table can also 
be downloaded for further analysis. Like the 'Sequence 
Based Comparison Tool', the 'Function Based 
Comparison Tool' compares two genomes to assess 
similarities and differences in the presence of functional 
roles that have been linked to subsystems. This enables the 
user to view unique functions found in either genome. 
Results of this analysis can also be downloaded for 
further study. 

The SEED Web site also allows users to browse the 
current collection of subsystems, which are proteins 
grouped by a relationship in function (5). For instance 
the subsystem 'tRNA aminoacylation Phe' includes the 
functional roles, 'Phenylalanyl-tRNA synthetase alpha 
chain (EC 6.1.1.20)' and 'Phenylalanyl-tRNA synthetase 
beta chain (EC 6.1.1.20)'. The subsystem spreadsheet is 
populated with all genomes that have those functional 
roles and provides links to the relevant protein pages. 
The subsystem info tab provides an expert annotator's 
notes on the creation of the subsystem. Although they 
are not comprehensive, the SEED subsystems are a par- 
ticularly useful way to quickly determine the proteins that 
are involved in a related function and to determine known 
variations in functionality between organisms. Experts in 
areas of microbial biochemistry and physiology are 
encouraged to annotate genes on the pubic version of 
the SEED (http://pubseed.theseed.org), so that their 
knowledge can be propagated to the scientific community. 

Programmatic access to SEED data 

A network-based API allows programmatic access to all 
of the data that exist within the SEED (17). A compre- 
hensive set of tutorials for accessing data and the software 
necessary to interact with the SEED servers can be found 
here (http://www.theseed.org/servers/). SEED data can be 
accessed via four different servers: the Sapling server 
contains genomic data, the ANNO server supports 
capabilities relating to annotation, the RAST server 
enables batch submission to RAST and the Model 
server provides access to metabolic modeling data 
underlying the Model SEED (9). 

As most of the API access routines are used repeatedly 
and writing new code can be labor intensive, the SEED 
also offers a large repository of >150 server scripts 
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Figure 1. The 'Compare Regions' tool in the SEED. The Staphylococcal SCCmec element is shown as an example. Re-arrangements within 
Staphylococcal SCCmec element lead to constitutive expression of resistance determinant MecA due to (partial) deletion of repressor Mecl and/ 
or sensor-transducer MecR. Homologous genes are presented as arrows with matching colors and numbers. Genes not conserved within the 
displayed region are gray. The graphic is centered on the focus gene (red, #1): Methicillin resistance determinant MecA; green, #8: Methicillin 
resistance regulatory sensor-transducer MecRl; blue, #18: Methicillin resistance repressor Mecl; green, #2: transposase for IS431. 




Figure 2. Circle plot showing the comparison of eight Brucella 
genomes relative to a user-defined reference genome. The zoomed 
regions highlight insertions/deletions (colored versus white) and 
changes in conservation relative to the reference genome (going from 
blue representing the highest protein sequence similarity to red repre- 
senting the lowest). 



(http://pubseed.theseed.org/sapling/server.cgi7pod = Ser 
verScripts). Each server script is a small program that 
accesses the SEED servers from the command line. 
These server scripts perform a multitude of common 
tasks. For example, 'sv^al^genomes 1 will return the sci- 
entific name and genome identifier for every genome in 
SEED, and 'svr_function_of returns the functional role 
for a given protein identifier. The server scripts can be 



piped together to create a powerful suite of bioinformatics 
tools, yet require little programming knowledge to use. 
The SEED server scripts are distributed as part of the 
myRAST installation (described later in text). 

SEED-supported resources 

The use of a standard vocabulary and continual improve- 
ment of genome annotations coupled with a robust 
database structure has made the SEED project an attract- 
ive venue for several productive collaborations (Table 1). 
The SEED currently offers data supporting NMPDR, the 
National Microbial Pathogen Data Resource (unfunded, 
Web site operational) (20); PATRIC, the Pathosystems 
Resource Integration Center; the all-bacterial BRC 
(Bioinformatics Resource Center) (http://www.patricbrc. 
org) (21); PhAnToMe, Phage Annotation Tools and 
Methods (http://www.phantome.org) (unfunded, Web 
site operational); Model SEED (9) and the U.S. 
Department of Energy KBase project (in progress). 

RAST 

RAST, Rapid Annotations using Subsystems Technology 
(8), is an automatic annotation server for microbial 
genomes, built upon the framework provided by the 
SEED system. A new user must register for the service, 
which involves giving us contact information and 
acquiring a password. By registering users, we can create 
a framework in which users have access to only those 
genomes that they have submitted. It allows us also to 
contact the user once the automatic annotation has 
finished or in case user intervention is required. RAST is 
designed to consistently produce annotations comparable 
in quality to those produced by the best human annotators 
and to extend those annotations to as many protein- 
encoding genes in as many genomes as possible. 
Continuous addition of new subsystems that cover 
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Table 2. Major milestones and improvements in the RAST system over the past 5 years 
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500-1000 genomes/day 


Maximum throughput 


300 genomes/day 


1000 genomes/day 
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1600 
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Data types accepted 


Complete genomes 


Phages, plasmids, draft genomes, complete genomes 


Formats accepted 


FASTA 


FASTA, GenBank 


Submissions 


Single, web-based submissions only 


Web submissions and batch submissions 


ORF calling 


Glimmer2 


Gimmer3, RAST, user provided ORF calls 



previously un-annotated regions of the genomes, and con- 
tinuous quality control of existing subsystems are central 
to improved annotations in the SEED and their propaga- 
tion via FIGfams into RAST (5,8,10,22,23). RAST- 
annotated public genomes are then introduced into the 
SEED and included in the SEED curation. The SEED 
= > FIGfam = > RAST = > SEED cycle is at the heart 
of SEED-based annotations. 

RAST was introduced in 2007, and concomitant with 
the plummeting cost of DNA sequencing, we have seen 
the number of genomes annotated by RAST increase by 
>2 orders of magnitude, from 350 genomes in the initial 
release to >60000 distinct genomes (> 100 000 jobs 
submitted) annotated to date (Table 2). Although the 
number of jobs continues to grow (Figure 3), the 
average time to compute a job has decreased slightly 
over the years (data not shown) as both faster computers 
are deployed to our infrastructure and improvements 
to our algorithms are incorporated in our code base. 
Currently, the RAST server is used routinely to anno- 
tate 200-300 prokaryotic genomes daily (up to 700 
at peak loads), of which over two-thirds are unique 
and >1 Mb long. In the next 5 years, we antici- 
pate annotating hundreds of thousands of microbial 
genomes. 

All of the nearly 12 000 bacterial genomes available 
from PATRIC have been consistently annotated using 
RAST. PATRIC provides researchers with a resource 
that stores and integrates a variety of data types 
(genomics, transcriptomic, protein-protein interactions, 
3D protein structures and sequence typing data) with 
their associated metadata. Data are summarized at the 
level of the individual genome and across taxonomic 
levels (21). PATRIC also allows researchers to compare 
RAST annotations with those from other sources, most 
notably annotations from GenBank/RefSeq. 

Figure 4 shows the genomes annotated by RAST for 
PATRIC displayed on a taxonomy-based tree for the 
orders in the bacteria and archaea (24). All of those 
genomes (unlike other RAST annotated genomes) are 
public. They can be used to visualize the great diversity 
of genomes that have been annotated by RAST. 



The RAST pipeline 

The RAST pipeline implements the following steps to 
annotate a prokaryotic genome: 

(1) Identify the selenoproteins and pyrrolysoproteins. 
These special case genes are sought using custom 
algorithms. There is a growing set of such special 
cases where domain-specific knowledge is required 
to recognize the genes and most alignment 
programs such as BLAST are not sensitive 
enough to discriminate between the special-case 
gens and the similar but non-special-case genes. 

(2) Generate an estimate of the 30 closest phylogenetic 
neighbors in the SEED by comparing ab initio 
GLIMMER3 gene-candidates with a set of univer- 
sal proteins plus up to 200 'unduplicated' proteins 
(26). These gene candidates are only used to 
identify the phylogenetic neighborhood and to 
help 'bootstrap' iterative retraining of 
GILMMER3 and are not retained in the final 
annotation. 

(3) Identify the tRNA and rRNA genes using 
'search_for_rnas , (Niels Larsen, unpublished, avail- 
able from the author on request), which uses 
tRNAscan-SE to find tRNAs (27) and BLASTN 
(19) against a set of RNA databases followed by 
endpoint adjustment to find rRNAs. 

(4) Test all of the gene candidates from step 2 to 
identify those that are similar to proteins in subsys- 
tems using signature amino-acid A'-mers (sets of 
eight sequential amino acids). The &-mers allow us 
to rapidly scan the gene candidates against all 
known proteins, as we have described for 
metagenomes elsewhere (28). Candidates having k- 
mer evidence for a subsystem-based function are 
'promoted 1 to the status of 'protein-encoding 
gene' (PEG), and assigned >1 functional roles 
based on that /c-mer evidence. 

(5) Iteratively retrain GLIMMER3 on the set of gene 
candidates validated by /c-mers in step 4. Steps 4 
and 5 are repeated until no new gene candidates 
are found that are similar to those in subsystems. 
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Figure 3. Number of users (open squares) and number of jobs (closed circles) in the RAST system. As of September 2013, there were over 100 000 
jobs processed by RAST and > 12 000 active users of the system. 



Gene candidates are only retained if they match a 
gene in a subsystem and do not significantly 
overlap a gene that was called previously. In 
practice, convergence is usually achieved after 
three iterations and 'overtraining' is not observed. 

(6) Any remaining gene candidates that do not signifi- 
cantly overlap an existing gene call are included if 
they are similar to any protein in the 30 closest 
neighbors using BLASTP (19). 

(7) Any remaining gene candidates that do not signifi- 
cantly overlap an existing gene call are included. 

(8) Gene fragments that may contain frameshifts due 
to low-quality sequencing are detected by 
comparing with the template genes in the 30 
nearest neighbors. If requested by the user, these 
gene fragments are joined to a single gene, and 
detailed statements of what was inferred and why 
are recorded. 

(9) Any DNA stretches longer than 1500 bp that do 
not contain a gene are 'backfilled' with gene candi- 
dates by comparing them with the proteins from 
the 30 nearest neighbors using BLASTX (19). 

(10) Functions are assigned to products of genes 
without Ar-mer-based assignments by using 
BLASTP similarities. 

(11) If a gene candidate has not been assigned a subsys- 
tem-based functional role, and it has flanking genes 
with subsystem-based functional roles, then it is 
compared with the nearest neighbors from step 2. 
If all three genes are bidirectional best hits (BBHs) 
to the corresponding set of three genes in a neigh- 
boring genome, then the current assignment is 
replaced by the subsystem-based functional role 
from the neighboring genome. 

(12) Missed genes are identified by examining remaining 
gaps flanked by genes that are BBHs to genes that 
are in subsystems in a neighboring genome. 

(13) Gene candidates that do not have subsystem or 
BLAST support and are embedded within another 
gene, significantly overlap a gene or are extremely 
short (<90nt) are removed. 



(14) Subsystem analyses and initial metabolic recon- 
structions are performed. The subsystems analysis 
calculates which subsystems are reflected in the 
genome, and for each subsystem estimates the 
most likely variant. The metabolic reconstruction 
connects the annotations to the metabolic model 
in preparation for flux balance analyses in the 
Model SEED (9). 

(15) Pairs of close bidirectional best hits (PCBBHs) are 
computed against genomes in the PubSEED. These 
support estimates of functional coupling based on 
conserved contiguity (29,30). 

(16) Genome data are exported in GenBank, EMBL, 
GFF3, GTF, Excel and tab-delimited formats. 

Due to its popularity, there have been many attempts to 
use RAST to annotate chunks of DNA that were not 
contigs in prokaryotic genomes. Because of the iterative 
approach of the annotation algorithm and the reliance on 
closely related genomes, RAST is not able to annotate 
mixed sequences (e.g. mixed culture genomes, 
metagenomes). However, we have adapted the RAST 
pipeline to annotate phage and plasmid genomes, which 
often have close homologs. The phage/plasmid pipeline 
(invoked automatically for submissions of <100kb in all 
contigs) involves finding the RNAs and close neighbors 
using the pipeline described earlier in text, but substituting 
MGA (31) for GLIMMER3 in the initial gene calling step. 
Step 5 of the pipeline, the iterative gene calling, is only run 
once, and all candidate genes are accepted. All subsystems 
are used to annotate the phage genes, but the ~50 phage- 
specific subsystems introduced by the PhAnToMe project 
(http://www.phantome.org/) enhance the quality of 
phage-specific genome annotations. The pipeline then 
skips forward to Step 8, identifying and repairing frame 
shifts, and the rest of the pipeline continues as described. 

Manual improvements to RAST-annotated genomes 

The RAST user interface (derived from the SEED inter- 
face) allows registered users to make manual changes to 
their genomes before retrieving them. The user can elect to 
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Figure 4. Genomes processed by RAST displayed over a taxonomic tree. In all, 12 289 RAST annotated public genomes for PATRIC available on 
the PubSEED were compared at the order level using the NCBI taxonomy (25). Black bars show the number of sequenced representatives per order. 
White bars show those orders with no sequenced representatives. The tree was created using the Interactive Tree of Life (http://itol.embl.de/) and is 
unrooted. 



delete or add gene calls, adjust start positions for genes, 
change functional role annotations and re-compute the 
subsystems asserted. Tutorials on manual annotation are 
available from the RAST entry page. 

myRAST 

We have implemented several high-performance web 
services for computation against SEED data (17). These 
SEED web services may also be accessed via a standalone 
application called myRAST, a demonstration project built 
using SEED web service technology. myRAST supports 
automated and manual annotation of both genomic data 
and collections of metagenomic (DNA) data. Genomic 



data are annotated using the SEED servers to identify 
protein-encoding genes and RNA genes similar to the 
RAST pipeline described earlier in text, and to annotate 
the protein-encoding genes using the SEED A:-mer-based 
annotation algorithm (28). The annotated genomes are 
installed into a local (to the user's computer) relational 
database using the SEED ERDB technology. myRAST 
is freely available for download from the web at (http:// 
blog.theseed.org/servers/installation/distribution-of-the- 
seed-server-packages.html). An article describing 
myRAST in detail is in preparation. 

The myRAST application also computes an estimate of 
the genomes most closely related to the user's genome, and 
then computes a set of fairly conservative correspondences 
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between the user's genome and each genome in this set. 
These data are used to drive the myRAST compare 
regions viewer, which is similar to the compare regions 
viewers in the SEED and in RAST. 

myRAST may also be used to load and visualize the 
SNP analysis available in the SEED toolkit. Here, a set 
of user genomes is analyzed in comparison with a single 
reference genome. This analysis generates gene calls and 
annotations as propagated from the reference genome, as 
well as a set of SNPs occurring in both the genes and the 
intergenic regions. For each SNP the user may view the 
corresponding DNA or protein alignments. 

FUTURE DEVELOPMENTS 

Due to increasing demand, RAST will soon support 
annotating organisms from the same species using a ref- 
erence genome specified by the user. When specified, an 
attempt will be made to inherit all annotations from the 
reference genome and also propagate gene names. Because 
gene names are used inconsistently across species, neither 
the SEED nor RAST has ever attempted to propagate 
them (32). For example, the gene sirA of Salmonella is 
also known as uvrY in E. coli or gacA in Pseudomonas. 
Instead, the SEED and RAST attempt to consistently 
propagate subsystem-based functional roles. 

Performance in RAST is a constant issue, especially in 
the face of exponentially increased use. We have recently 
installed changes that allow us to process >700 jobs per 
day. Although we expect to improve performance further, 
our efforts are now largely directed at achieving improved 
accuracy (10,23). We are also planning to redesign the 
user interface for the SEED and RAST to accommodate 
the wealth of genomes. The community is constantly 
producing tools that recognize, and often characterize, 
specific classes of genome features. We are planning to 
add several more of these new specialized tools to our 
pipeline, such as the recognition of BOX elements in 
Streptococci (33) and the identification of CRISPRs (34) 
and so forth. 

We intend to institute a 'Publish to PATRIC button 
that will allow users to immediately share their genomes 
publicly through the PATRIC portal. The PATRIC iden- 
tifier can then be used in publications to direct others to 
the annotated genome product. Genomes that have been 
exported to PATRIC can then use the wide suite of tools 
that PATRIC has to offer to explore and compare 
annotated genomes, and to compare annotations from a 
variety of sources. 
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